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Abstract 

We  introduce  Time-Sensitive  Dirichlet  Process  Mixture  models  for  clustering.  The 
models  allow  infinite  mixture  components  just  like  standard  Dirichlet  process  mixture 
models.  However  they  also  have  the  ability  to  model  time  correlations  between  in¬ 
stances. 
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1  Introduction 


Traditional  clustering  algorithms  make  two  assumptions  that  are  often  false  in  practice: 
1.  that  the  number  of  clusters  is  known;  2.  that  the  data  points  are  independent.  We 
propose  a  model  that  allows  infinite  number  of  clusters,  and  cluster  members  may  have 
certain  dependency  in  time. 

Consider  emails  received  by  a  user  over  a  period  of  time.  Suppose  we  want  to 
cluster  the  emails  by  topic  thread.  There  are  several  ways  to  do  this: 

•  We  can  sort  emails  by  the  ‘subject’  line.  However  it  is  unreliable  and  we  want  a 
more  flexible  probabilistic  model  based  on  email  content. 

•  We  can  model  each  thread  with  a  multinomial  distribution  over  the  vocabulary, 
and  treat  each  email  as  a  ‘bag  of  words’.  The  whole  email  collection  can  be 
modeled  as  a  mixture  of  multinomial.  The  problem  is  that  we  do  not  know  the 
number  of  threads  (mixing  components).  Fixing  the  number,  which  is  a  common 
practice,  seems  arbitrary. 

•  We  can  model  the  collection  as  a  Dirichlet  process  mixture  model  (DPM)  [1], 
DPMs  allow  potentially  infinite  number  of  components.  Nonetheless  DPMs  are 
exchangeable.  When  applied  to  emails,  this  means  that  old  threads  never  die 
down.  This  is  undesirable  because  we  want  the  emails  from  years  ago  to  have 
less  influence  than  those  from  this  morning  in  predicting  the  next  email. 

We  therefore  would  like  to  introduce  the  concept  of  time  into  DPMs,  while  keeping 
the  ability  to  model  unlimited  number  of  clusters.  This  is  achieved  with  the  proposed 
Time-Sensitive  Dirichlet  Process  Mixture  (tDPM)  models. 

2  The  tDPM  Framework 

Consider  a  sequence  of  input  d  with  time  stamp  t:  (di,ti), . . . ,  ( dn,tn ),  where  the  time 
monotonically  increases.  For  concreteness  let  us  assume  the  d’s  are  email  documents, 
each  represented  as  a  bag-of-word  vector.  Let  s,  £  {1,2, ...}  be  the  true  cluster 
membership  (email  thread)  of  di.  Notice  we  do  not  set  the  number  of  clusters  a  priori. 
There  could  potentially  be  an  unlimited  number  of  clusters  as  the  number  of  documents 
n  grows. 

Without  loss  of  generality  we  assume  that  each  cluster  j  is  represented  by  a  multi¬ 
nomial  distribution  0:J  over  the  vocabulary.  The  probability  for  cluster  j  to  generate 
document  di  is  then 

P{di\ej)=  n  e3{v)di(v)  (1) 

tievocabulary 

Since  past  email  threads  can  influence  the  current  email,  we  want  s,  to  depend  on 
the  history  si, . . . ,  Sj-i.  We  also  want  such  dependency  to  vary  with  time:  older  emails 
should  have  less  influence.  We  introduce  a  weight  function  w(t,j)  which  summarizes 


1 


t  t 

(a)  (b) 


Figure  1:  (a)  The  time  kernel  with  A  =  0.5.  (b)  The  weight  functions  with  data  from 
two  clusters,  marked  as  star  or  circle  respectively. 


the  history  at  time  t.  It  gives  the  weight  (or  ‘influence’)  of  cluster  j  at  time  t,  given  the 
history  so  far  s1; . . . ,  :  t  j  <  t, 

w(t,j)=  ^2  k(t-ti )  (2) 

{i\ti<t,Si=j} 

Note  the  weight  function  is  the  sum  of  some  time  kernel  k.  In  the  email  example  we 
can  use  a  kernel  like  k(t)  =  exp(— Xt)  if  t  >  0,  and  k{t)  =  0  if  t  <  0.  This  kernel 
stipulates  that  an  email  will  boost  the  probability  of  the  same  thread  in  later  emails,  but 
the  boost  decreases  exponentially  as  specified  by  the  parameter  A.  Figure  1(a)  shows 
an  example  time  kernel  with  A  =  0.5,  while  (b)  shows  two  weight  functions  built  upon 
the  kernel.  In  the  example  there  are  documents  from  cluster  1  at  time  0,2,6,  and  from 
cluster  2  at  time  3,4.  Other  forms  of  the  time  kernel  are  possible  too. 

We  define  the  prior  probability  of  assigning  clustery  to  d, ,  given  the  history  Si, . . . ,  Sj_i, 
to  be 

P(s%  =y|si,...,s;_i)  (3) 

=  P(si  =  j\w(ti,  •))  (4) 

=  (  x/J(u]')+«  ifj  is  in  history  (5) 

I  t' - n  —  if  j  is  new 

l  Zdj>  ')+ot  J 

where  a  is  a  concentration  parameter.  We  call  this  a  time-sensitive  Dirichlet  process 
mixture  (tDPM)  model.  Intuitively  if  there  has  been  many  recent  emails  from  cluster 
j,  the  new  email  will  have  a  large  probability  also  from  j.  In  addition,  there  is  always 
a  possibility  that  the  new  email  is  from  a  new  cluster  not  seen  so  far. 

tDPM  is  very  similar  to  the  standard  Dirichlet  process  mixture  (DPM)  models. 

In  fact,  it  can  be  shown  that  if  the  time  kernel  A:  is  a  step  function,  then  we  recover 
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Figure  2:  The  graphical  model  for  Time-sensitive  Dirichlet  Process  Mixture  models,  d 
is  the  feature  (e.g.  words  of  an  email),  t  is  the  time  stamp,  s  is  the  cluster  label,  and  w 
is  the  sufficient  statistic  that  summarizes  the  history.  Shaded  nodes  are  observed. 


the  standard  DPMs.  It  is  the  decaying  of  k  over  time  that  allows  us  to  include  time 
information  in  to  the  process.  The  graphical  model  representation  of  tDPM  is  given  in 
Figure  2. 

3  Inference 

Given  d  and  t,  we  would  like  to  infer  s.  We  use  a  Markov  Chain  Monte  Carlo  method. 
Notice  w  is  a  deterministic  function  of  s  and  t  and  does  not  need  to  be  sampled.  As 
shown  later  if  we  used  conjugate  priors,  we  do  not  need  to  actually  sample  9  but  can 
analytically  integrate  it  out.  Therefore  we  only  need  to  sample  s. 

In  Gibbs  sampling,  we  need  to  sample  s,  from  the  distribution 

P(Si  j  |S— j,  d\ ,  .  .  .  ,  dji)  OC  P(Sz  — j  |-S_ i)P{di \d—i:s_i=j )  (6) 

where  d_j:s_i=J  is  the  set  of  documents  in  cluster  Sj  =  j,  excluding  d, . 

The  prior  P(s,  =  j|s_j)  in  (6)  involves  all  nodes  before  and  after 

P(st  =j\S-i) 

i-t  \ 

1PJ  *  *  *  5  Sm—  l)  |  P($i  =  jl^l} 

m= 1  / 

/  n 

OC  P(Si=j\si,...,8i-i)\  ]^[  P(sm|si,. 

\m=i-\- 1 

Substituting  in  the  definition  (5),  it  is  easy  to  show  that  the  denominators  are  the  same 
for  different  values  of  j,  and  the  only  difference  is  in  the  numerator. 


•  n  p(  •  •  •  5  Sm— l)  j 

\m=i+ 1  / 

■  •  •  >  sm—  l)  ]  (7) 
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The  likelihood  term  p(di\d-i-.s_i-j)  in  (6)  is  domain-specific.  For  the  email  task, 
a  Dirichlet-multinomial  [2]  is  the  natural  choice: 

p{di\d-i,s_i=j)  =  J  p(di\9)p(9\d-i:s_i=j)d9  (8) 


where  p{9\d-i:s_i—j)  is  a  posterior  Dirichlet  distribution.  The  posterior  is  derived 
from  a  prior  (base)  Dirichlet  distribution  Go,  and  the  observed  data  d_i:S  i—j.  Let  the 
Dirichlet  prior  Go  be  parameterized  by  /3m,  where  m  is  a  vector  over  the  vocabulary 
and  m  sums  to  1,  and  /3  is  the  strength  of  the  prior: 


p(Q\(3m) 


T'(fi)  I  I  n0mv  —  l 

Hu  r(/3m„)  ” 


(9) 


Treating  the  document  collection  d-i:s_i=j  as  a  single,  large  document,  the  Dirichlet 
posterior  after  observing  counts  fv  for  word  v  in  d^i:s_i-j  is 

i.m  =  <10> 

And  the  Dirichlet-multinomial  is 

P(dt\d_i,s_i=j)  =  j  p{dl\9)p(9\d_i..s_i=j)d9  (11) 

r(E»  fv  +  P)  IL  rK(E  +  fv  +  /3 mv) 

n„  T(fv  +  Pmv)  T(J2V  di{v)  +  fv  +  P) 

Putting  everything  together  for  (6),  we  can  fix  all  other  s  and  sample  for  S{.  A  single 
Gibbs  sampling  iteration  consists  of  looping  through  i  =  1 . . .  n  and  sample  .st  in  turn. 
The  algorithm  is  given  in  Figure  3.  The  time  complexity  is  0(n2)  for  each  iteration 
of  the  Gibbs  sampler.  If  k  has  limited  support,  the  complexity  reduces  O(n)  but  we 
lose  the  ability  to  model  long  range  correlations.  Finally  we  run  the  Gibbs  sampler  for 
many  iterations  to  get  the  marginals  on  s. 

Some  readers  may  be  disturbed  by  the  apparent  ‘double  counting’  in  Figure  3  when 
we  assign  u{c)  =  a  to  not  only  the  brand  new  state  cnew,  but  also  to  states  not  in  {s<j} 
but  in  {s>j}.  We  assure  the  readers  that  it  is  merely  an  artifact  of  numbering.  If  we 
were  to  renumber  the  states  at  each  iteration,  we  can  recover  (5)  exactly. 


4  Parameter  Learning 

The  parameters  of  the  model  include  the  base  Dirichlet  distribution  Go,  the  concentra¬ 
tion  parameter  a,  and  the  time  kernel  parameter  A.  We  fix  the  base  Dirichlet  Go-  For 
the  time  being  let  us  assume  that  all  clusters  share  the  same  kernel  parameter  A.  The 
free  parameters  are  0  =  {a,  A}. 

We  learn  the  parameters  by  evidence  maximization.  Since  our  model  is  conditioned 
on  time,  the  evidence  is  defined  as 

P{D\T,  P(D\S)P{S\T,  0)  (13) 

s 
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for  position  i  =  1  to  n 

/*  C  is  the  candidate  states  for  Si,  */ 

/*  where  {s_,}  is  the  set  of  current  states  at  positions  other  than  i,  */ 

/*  and  cnew  {s-i}  is  a  new  state,  represented  by  an  arbitrary  new  number.  */ 
C  =  {s-i}  U  {cnew} 


/*  Compute  the  unnormalized  probability  p(si  =  c|s_j)  for  all  candidate  c  */ 
for  c  €  C 

/*  evaluate  candidate  =  c  */ 

Si  <—  c 

/*  Prior:  the  history  part.  {s<i}  is  the  set  of  states  before  position  i  */ 
if  c  e  {s<i}  then  u(c )  =  wc{U) 
else  u{c)  =  a 

/*  Prior:  the  future  part.  */ 
for  j  =  i  +  1  to  n 

if  Sj  G  {s<j}  then  u{c)  =  u(c)  *  wSj  ( tj ) 
else  u(c)  =  u(c)  *  a 

end 

/*  Likelihood.  */ 

u(c)  =  u(c)  *  P{di\d_i,s_i=c) 

end 

/*  pick  the  state  Sj  with  probability  proportional  to  u()  */ 

Si  ~  u(C) 


Figure  3:  A  single  Gibbs  sampling  iteration  for  tDPM 
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where  D  is  the  set  of  all  documents,  T  is  the  corresponding  set  of  time  stamps,  and  S 
is  the  set  of  cluster  assignments.  We  want  to  find  the  best  parameters  0*  that  maximize 
the  evidence: 


0*  =  arg  max  P(.D | T,  0)  (14) 

=  arg  max  ^  P(D\S)P(S\T,  0)  (15) 

0  s 

We  find  the  parameters  with  a  stochastic  EM  algorithm.  The  cluster  labels  S  are 
hidden  variables.  Let  0o  be  the  current  parameters.  We  can  sample  ,S’rlj  . . .  S<M>  from 
the  posterior  distribution  P(S\D,  T,  0o),  as  detailed  in  section  3.  In  generalized  EM 
algorithm,  we  seek  a  new  parameter  0  which  increases  the  expected  log  likelihood  of 
the  complete  data 

<5(0o,0)  =  EP(s\D,T,e0)  [logP(<S,-D|T,  0)]  (16) 

=  Ep(S\D,T,e0)  [log  P(D\S)  +  log  P(5|T,  0)]  (17) 

Notice  log  P( D\S)  does  not  depend  on  a,  A.  We  approximate  the  expectation  by  sam¬ 
ple  average 


<5(0o,©)  —  Const(Q)  +  Pp(s|_D,T,e0)  [log P( SIT,  ©)] 


M 


Const(Q)  +  —  log  P(S^\T,  0) 


And  we  find  the  gradients  w.r.t.  0  for  parameter  update 


M 


m— 1 
M  N 


m—  1  i— 1 

where  P(s\m'> \s^m'> . . .  s[™l ,  T,  0)  is  defined  in  (5).  The  gradients  ; 


are: 


log  P(si\Sl...  s^, T,Q)  =  \ 


if  Si  in  history 


da 


— vt —  if  Si  new 


(18) 

(19) 


(20) 

(21) 


(22) 


J^logP(si|si...si_i,T,0) 

(  _  E c  &w(ti,c) 

J  w(ti,Si)  Y.C  w(ti,c)+a 

I  Eo 

l,  Ec  w(ti,c)+a 


if  Si  in  history 
if  Si  new 


(23) 


6 


where 


53  W  ~ti)  =  Yl  e~X(t~u)  (24) 

i:ti<t,Si=c 

53  —  (i  —  *i)e_A(*_*°  (25) 

i:ti<t,Si=c 

We  then  take  a  gradient  step  in  the  M-step  of  the  generalized  EM  algorithm  to  improve 
the  log  likelihood. 

5  Experiments 

We  create  synthetic  datasets  which  have  explicit  time  dependency  between  instances, 
and  use  them  to  illustrate  the  time  sensitivity  of  tDPM  models. 

All  synthetic  datasets  have  n  =  100  instances.  We  first  create  the  time  stamps  of 
each  instances  by  sampling  from  a  Poisson  process.  In  particular,  the  interval  between 
two  consecutive  time  stamps  has  an  exponential  distribution  with  mean  I/7  =  1: 

p(u+ 1  -  ti )  =  (26) 

For  the  instance  d,  at  time  /:,,  its  state  s,;  is  sampled  from  the  conditional  distribution 
(5).  We  use  an  exponential  function  as  the  kernel 

k(t)  =  e~°-5t,t  >  0  (27) 

and  the  concentration  parameter  a  is  set  to  0.2.  This  emulates  the  situation  where  new 
clusters  are  created  from  time  to  time,  and  a  cluster  stays  alive  better  if  many  preceding 
instances  are  from  the  cluster. 

If  a  new  cluster  c  is  created,  we  sample  its  multinomial  distribution  9C  from  the 
base  distribution  Go-  The  base  distribution  Go  is  a  flat  Dirichlet  on  a  vocabulary  of 
size  three:  Go  ~  Dir(  1, 1, 1),  so  that  all  multinomials  are  equally  likely.  Finally  docu¬ 
ments  are  sampled  from  their  corresponding  multinomial  9,  where  all  documents  have 
the  same  length  |d|.  We  create  two  datasets  with  document  length  |d|  equals  20  and 
50  respectively,  with  everything  else  being  the  same.  Given  that  the  vocabulary  size  is 
3,  they  correspond  to  somewhat  hard  (less  words)  and  easy  (more  words)  datasets  re¬ 
spectively.  Figure  4  shows  time  vs.  cluster  plots  of  the  two  datasets.  Notice  documents 
from  the  same  cluster  tend  to  group  together  in  time,  which  fits  our  intuition  on  real 
world  problems  like  emails. 

During  evaluation,  the  input  to  various  algorithms  are  the  documents  di  and  their 
time  stamps  ti ,  and  the  goal  is  to  infer  the  clustering  st.  Notice  the  true  number  of 
clusters  is  not  given  to  the  algorithms. 

For  the  tDPM  model,  we  assume  we  know  the  true  base  distribution  Go  ~  Dir(  1, 1,1), 
concentration  parameter  a  =  0.2,  and  the  kernel  k(t)  =  e  We  run  the  Gibbs  sam¬ 
pler  with  initial  states  si  =  . . .  =  sn  =  1.  Each  MCMC  iteration  updates  s\, . . .  ,sn 
once,  and  thus  consists  of  n  Gibbs  steps.  We  ignore  the  burn-in  period  of  the  first  100 
MCMC  iterations,  and  then  take  out  a  sample  of  sn  every  11  iterations.  In  this 


w(t,c )  = 

d  .  , 

Vxw(t,c)  = 
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experiment  we  take  out  109  samples  altogether.  We  evaluate  the  performance  of  tDPM 
by  three  measures: 

1.  Number  of  clusters  discovered.  Notice  each  sample  si, . . . ,  sn  is  a  clustering  of 
the  data,  and  different  samples  may  have  different  number  of  clusters.  In  fact 
Figure  5(a,b)  shows  the  distribution  of  number  of  clusters  in  the  109  samples, 
on  the  hard  (|d|  =  20)  and  easy  (|d|  =  50)  synthetic  datasets  respectively.  The 
modes  are  at  12  and  15,  very  close  to  the  true  values  of  12  and  14  respectively. 

2.  Confusion  matrix.  One  way  to  combine  the  samples  with  possibly  different  num¬ 
ber  of  clusters  is  to  compute  the  n  x  n  confusion  matrix  M,  where  Mj7-  is  the 
probability  that  i,j  are  in  same  cluster.  This  can  be  easily  estimated  from  the 
109  samples  by  the  frequency  of  i,j  in  the  same  cluster.  Ideally  M  should  be 
similar  to  the  ‘true  confusion  matrix’  M* ,  defined  by  M*-  =  1  if  the  true  cluster 
has  label  .s,  =  Sj,  and  0  otherwise.  In  Figure  5(c,d)  we  plot  the  true  confusion 
matrices  M* .  Notice  we  sort  the  instances  by  their  true  cluster  for  better  visu¬ 
alization.  In  Figure  5(e,f)  we  plot  the  tDPM  confusion  matrices,  using  the  same 
order.  They  are  reasonably  similar. 

3.  Variation  of  Information.  We  compute  the  variation  of  information  measure  [3] 
between  the  true  clustering  and  each  sample  clustering.  We  list  the  mean  and 
standard  deviation  for  the  two  synthetic  datasets:  (hard)  0.9272  ±  0.1718,  (easy) 
0.1245  ±0.0911. 

We  compare  tDPM  to  a  standard  DPM  model,  by  using  a  step  function  as  the  ker¬ 
nel  k.  Again  we  assume  we  know  the  true  base  distribution  Go  ~  Dir(  1, 1, 1),  and 
concentration  parameter  a  =  0.2.  The  Gibbs  sampling  is  done  exactly  the  same  as  in 
tDPM.  We  find  that 

1.  Number  of  clusters  discovered.  Figure  6(a,b)  shows  the  distribution  of  number 
of  clusters  with  DPM.  DPM  discovers  fewer  clusters  than  tDPM.  The  modes  are 
at  6(or  7)  and  9  respectively,  and  the  true  values  are  12  and  14. 

2.  Confusion  matrix.  In  Figure  6(c,d)  we  plot  the  DPM  confusion  matrices.  Notice 
they  are  much  less  similar  to  the  true  matrices. 

3.  Variation  of  Information.  With  DPM  we  have  (hard)  1.8627  ±  0.1753,  (easy) 
0.6630  ±  0.1144.  This  means  the  sample  clusterings  are  significantly  farther 
away  from  the  true  clustering,  compared  to  tDPM. 

To  summarize,  tDPM  is  better  than  the  standard  DPM  model,  when  the  instances 
have  a  time  dependency. 

6  Discussions 

The  tDPM  model  is  a  way  to  take  time  into  consideration.  Notice  it  is  different  than 
simply  adding  time  as  a  new  feature  for  cluster. 


The  tDPM  is  not  time  reversible  nor  exchangeable  in  general.  This  is  different  from 
the  standard  DPM.  It  is  both  a  blessing  and  curse.  It  allows  for  the  modeling  of  time, 
but  at  the  expense  of  computation. 

There  are  many  ways  one  can  extend  the  tDPM  model  proposed  here: 

•  The  time  kernel  k  can  have  different  forms.  For  example,  different  clusters  can 
have  different  decay  rate  A.  More  interestingly,  k  can  even  be  periodic  to  model 
repetitive  emails  like  weekly  meeting  announcements. 

•  Currently  the  models  for  each  cluster  are  stationary  and  do  not  evolve  over  time. 
This  can  potentially  be  relaxed. 

•  One  can  have  a  generative  model  on  time  dependencies.  For  example  one  can  as¬ 
sume  a  Poisson  process  on  cluster,  and  then  a  non-homogeneous  Poisson  process 
on  the  documents  within  the  cluster. 
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Figure  4:  Two  synthetic  datasets  with  \d\  =  20  (left)  and  |d|  =  50  (right)  respectively. 
Top  row:  Time  stamps  ti  vs.  cluster  ID  .s,;  Middle  row:  the  cluster  multinomials  6C\ 
Bottom  row:  word  counts  for  each  document  dj . 
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Figure  5:  tDPM  results  on  the  hard  (|d|  =  20,  left)  and  easy  (|d|  =  50,  right)  synthetic 
datasets,  (a,  b)  Number  of  clusters  discovered  in  MCMC  samples;  (c,  d)  Confusion 
matrix  with  true  cluster  labels;  (e,  f)  Confusion  matrix  from  tDPM  MCMC  samples. 
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Figure  6:  Standard  DPM  results  on  the  hard  (|d|  =  20,  left)  and  easy  (|d|  =  50,  right) 
synthetic  datasets,  (a,  b)  Number  of  clusters  discovered  in  MCMC  samples;  (c,  d) 
Confusion  matrix  from  DPM  MCMC  samples. 
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